[Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2#2924
[Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2#2924lishunyang12 wants to merge 25 commits into
Conversation
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
…vllm-project#2920) Threads quant_config / prefix through HunyuanVideo15Attention, HunyuanVideo15TransformerBlock, and HunyuanVideo15Transformer3DModel so the modelopt FP8 adapter from vllm-project#2913 has somewhere to bind per-layer scales. Modulation, embeddings, proj_out stay raw nn.Linear (full precision). Signed-off-by: lishunyang <lishunyang12@163.com>
…eo-1.5 examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for HunyuanVideo-1.5. Calibrates with 8 video prompts x 10 denoising steps, skips precision-sensitive layers (modulation, embeddings, output proj, token refiner) matching the vllm-project#2728 / vllm-project#2795 pattern, disables MHA quantizers by default (HV-1.5 self-attention degrades visibly under FP8 - see vllm-project#2920). vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Auto-detects ModelOpt metadata from the checkpoint (uses vllm-project#2913's adapter). Signed-off-by: lishunyang <lishunyang12@163.com>
…ce_scale kwarg unsupported) HV-1.5's diffusers pipeline uses the new Guider abstraction (guider_config.json in the checkpoint) rather than a guidance_scale kwarg. Try setting it on the guider object once up front; in the per-prompt call, try with guidance_scale first and fall back without it on TypeError. Calibration only needs amax stats, so the exact CFG value isn't critical. Signed-off-by: lishunyang <lishunyang12@163.com>
Three checks: (A) transformer/config.json has sane quantization_config, (B) safetensors contain FP8 tensors, (C) optional disk-size delta vs BF16. Run after the quantize_*_modelopt_fp8.py scripts to spot issues before attempting to serve. Signed-off-by: lishunyang <lishunyang12@163.com>
…or view) torch's get_tensor() returns FP8 storage as bf16 views on some safetensors versions, giving false negatives. Read the on-disk dtype from the header directly — that's what actually determines whether the checkpoint is FP8. Signed-off-by: lishunyang <lishunyang12@163.com>
The default export_hf_checkpoint() doesn't actually serialize weights as FP8 for unknown model types like HunyuanVideo15Transformer3DModel — it saves BF16 placeholders. The HunyuanImage-3 calibration helper hit the same bug. Three changes: - Manually call modelopt.torch.export.unified_export_hf._export_quantized_weight per-module to convert in-memory tensors to actual FP8. - Save the pipeline by hand (copy source minus transformer/, then save the quantized transformer with hide_quantizers_from_state_dict). - Patch transformer/config.json to inject quant_algo: FP8 + config_groups so vllm-omni's adapter (vllm-project#2913) auto-detects it. Signed-off-by: lishunyang <lishunyang12@163.com>
…, not pipeline Diffusers pipelines are ConfigMixin, not nn.Module — they don't have .named_modules(). Pass pipe.transformer directly. Signed-off-by: lishunyang <lishunyang12@163.com>
…ation fp8, not --stage-configs-path Signed-off-by: lishunyang <lishunyang12@163.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
…block
When --weight-block-size 'M,N' is given, override the weight quantizer with
block_sizes={-1: N, -2: M} so each linear gets a (out//M, in//N) scale tensor
instead of a scalar. Patched config_groups advertises strategy='block' +
block_structure='MxN' so consumers know what to expect.
Static FP8 is exempt from upstream vLLM's online block-wise gate, so this
just works at serving time via vllm-project#2913's adapter.
Default behavior unchanged (per-tensor) — pass --weight-block-size 128,128
to opt in.
Signed-off-by: lishunyang <lishunyang12@163.com>
…s per-block) Reads shape info from safetensors header and classifies the checkpoint as per-tensor / per-channel / per-block based on whether weight_scale tensors are scalar, 1-D, or N-D. Helps verify --weight-block-size actually took effect (or if ModelOpt silently flattened to per-tensor). Signed-off-by: lishunyang <lishunyang12@163.com>
… granularity ModelOpt block-wise produces shapes like [16, 1, 16, 1] where size-1 dims are broadcasting axes. Classify by non-unity dim count: 0=per-tensor, 1=per-channel, 2+=per-block. Signed-off-by: lishunyang <lishunyang12@163.com>
…V-5B examples/quantization/quantize_wan2_2_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for Wan2.2 TI2V-5B (the dense 5B variant that fits 80GB BF16). Same design as the HunyuanVideo-1.5 calibrator (vllm-project#2924): force-export FP8 weights, patch quant_algo: FP8 into config.json, hide quantizers during save. Skips Wan2.2's precision-sensitive layers (condition_embedder, patch_embedding, proj_out, scale_shift_table, SP helpers). MHA quantizers off by default. vllm_omni/model_executor/stage_configs/wan2_2_ti2v_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Signed-off-by: lishunyang <lishunyang12@163.com>
…ject#2920) Threads quant_config / prefix through WanSelfAttention, WanCrossAttention, WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines (T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding (Conv3d), time/text/image embedders, and proj_out stay full precision. All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here — that was an online-FP8 quality workaround; static calibration handles it. Signed-off-by: lishunyang <lishunyang12@163.com>
…V-5B examples/quantization/quantize_wan2_2_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for Wan2.2 TI2V-5B (the dense 5B variant that fits 80GB BF16). Same design as the HunyuanVideo-1.5 calibrator (vllm-project#2924): force-export FP8 weights, patch quant_algo: FP8 into config.json, hide quantizers during save. Skips Wan2.2's precision-sensitive layers (condition_embedder, patch_embedding, proj_out, scale_shift_table, SP helpers). MHA quantizers off by default. vllm_omni/model_executor/stage_configs/wan2_2_ti2v_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Signed-off-by: lishunyang <lishunyang12@163.com>
…n.net_0) Wan2.2 ModelOpt FP8 checkpoint has diffusers-style dotted FFN names (ffn.net.0.proj, ffn.net.2) but vllm-omni's WanFeedForward uses underscored names (ffn.net_0.proj, ffn.net_2). The transformer's load_weights remaps these for .weight tensors, but the ModelOpt adapter resolves scale tensor names independently via WeightsMapper and was missing the remap — all 120 FFN scale tensors (30 blocks x 2 linears x 2 scales) silently fell through, leaving FP8 weights with no valid scales at serving time (visible as pure noise output). Fix: - Add hf_to_vllm_mapper class attribute on WanTransformer3DModel with the ffn remap. - Extend ModelOptFp8CheckpointAdapter._get_weights_mapper to merge a model's hf_to_vllm_mapper (if present) into the resolution map. Models can now register arbitrary substring remaps via this standard vLLM attribute. Signed-off-by: lishunyang <lishunyang12@163.com>
hsliuustc0106
left a comment
There was a problem hiding this comment.
This PR is substantial (>1000 LOC / >10 files). Could you please run the L3 tests locally and paste the results here?
Once L3 test results are available, I will proceed with a full review of the ModelOpt FP8 video-gen implementation.
Helps diagnose name-mismatch between checkpoint keys and model parameters (e.g. diffusers .ffn.net.0. vs vllm-omni .ffn.net_0.). Signed-off-by: lishunyang <lishunyang12@163.com>
…t FP8 adapter The adapter is instantiated with the whole Pipeline, not just the DiT. Only checking the top-level model means hf_to_vllm_mapper defined on a sub-module (e.g. WanTransformer3DModel inside Wan22TI2VPipeline) was invisible. Walk named_modules() and aggregate any mappers found. Signed-off-by: lishunyang <lishunyang12@163.com>
|
Hi, just want to double check. The throughput mentioned here is calculated directly by num_inference_step / wall_time ? Are these the throughput of DiT model only or includes all encoder/decoders ? |
The average throughput can be captured by progress bar(tqdm) during denoising step, which does not consider encoder/decoders processing time. If you want to comfirm, just check the starting and ending point for tqdm and see if encoder and vae are in between or not. |
…V-5B examples/quantization/quantize_wan2_2_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for Wan2.2 TI2V-5B (the dense 5B variant that fits 80GB BF16). Same design as the HunyuanVideo-1.5 calibrator (vllm-project#2924): force-export FP8 weights, patch quant_algo: FP8 into config.json, hide quantizers during save. Skips Wan2.2's precision-sensitive layers (condition_embedder, patch_embedding, proj_out, scale_shift_table, SP helpers). MHA quantizers off by default. vllm_omni/model_executor/stage_configs/wan2_2_ti2v_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Signed-off-by: lishunyang <lishunyang12@163.com>
* update quant wan2-2 modelopt to support A14B model * update wan2-2 modelopt quant script * update two gpu quantization quality script * update i2v modelopt quant script * update hunyuanvideo and wan2.2 vace modelopt script * add quantization config parsing image2video script * update vae-use-tiling for quantization quality script to avoid cuda oom for bf16 model * update vae-use-tiling for quantization quality script to avoid cuda oom for fp8 model(for vae) * update quantization quality script to support i2v videogen task * fix modelopt fp8 quantization script and quality script in T2V * update per-block quant * update vace videogen script * update quantization quality script to support model load and throughput calculation, and rewrite quant_quality script to automate model offline quant * fix quantization quality script in hunyuanvideo1.5 * update modelopt check script * update remote transmisson to bench_quant_videogen * update check_quant_videogen * update bench quant videogen script * update quality bench scripts to add negative prompt to wan2.2 I2V * update quality bench script for wan2.2 i2v * update quality bench script to add denoise throughput(s/it) * quant_quality script update for image gen model * del unrelative scripts * del unrelative scripts * update recommend test cmd after quantization for wan models Signed-off-by: ArtificialRay <shuaiweihuang@163.com> --------- Signed-off-by: ArtificialRay <shuaiweihuang@163.com>
…deo-1.5 video-gen Rebuilt on current main and consolidates the two video-gen ModelOpt PRs (vllm-project#2924 FP8 + vllm-project#3305 NVFP4) into one. The inference side — ModelOpt FP8/NVFP4/mixed checkpoint adapters, generic get_checkpoint_adapter loader wiring, and quant_config threading + PP through the DiTs — already landed on main (vllm-project#2913, vllm-project#3570 and follow-ups), so those commits are dropped as redundant. Net-new tooling that main does not have: - examples/quantization/: offline ModelOpt FP8 + NVFP4 calibration for Wan2.2 (TI2V-5B + VACE) and HunyuanVideo-1.5, plus export verifier, activation variance diagnostic, and NVFP4 quant_config patch helper. - stage_configs/{wan2_2_ti2v,hunyuan_video_15}_dit_fp8.yaml: DiT-only FP8 serve configs. Scripts are self-contained (diffusers + modelopt); produced checkpoints load via main's existing ModelOpt adapter. Signed-off-by: lishunyang12 <lishunyang12@163.com>
|
This will be covered by #3305 |
|
Superseded — consolidating the video-gen ModelOpt work into a single PR. The inference side this PR carried (ModelOpt FP8 checkpoint adapter + diffusers loader wiring + DiT The net-new pieces have moved:
|
Purpose
Phase 1 of #2709 — extends ModelOpt FP8 support to video-gen models. #2913 covers Phase 1 for image-gen (Flux, Flux2-Klein, Qwen-Image, HunyuanImage-3); this PR adds the video-gen counterpart for both HunyuanVideo-1.5 and Wan2.2 TI2V-5B using the same loader infrastructure.
Builds on:
quant_configwiring for HV-1.5 + Wan2.2 (extracted into this PR; [Quant] Wire quant_config through HunyuanVideo-1.5 and Wan2.2 DiT for online FP8 #2920 stays as online-FP8 ablation reference)Changes
DiT wiring (extracted from #2920)
hunyuan_video_15_transformer.py+ pipelines —HunyuanVideo15Attention,HunyuanVideo15TransformerBlock,HunyuanVideo15Transformer3DModelacceptquant_config/prefix; threaded toto_qkv,to_out[0],add_kv_proj,to_add_out,ff,ff_context.wan2_2_transformer.py+wan2_2_vace_transformer.py+ 4 pipelines —WanSelfAttention,WanCrossAttention,WanFeedForward(+ColumnParallelGELU),WanTransformerBlock,WanTransformer3DModel, VACE variant. Factories (create_transformer_from_config,create_vace_transformer_from_config) accept optionalquant_config.nn.Linear/scale_shift_table), patch embedders (Conv3d), time/text/image embedders,proj_out, and the HV-1.5 token refiner stay full precision.attn1/attn2quant_config=Noneon Wan2.2) are not applied here — that was an online-FP8 workaround; static calibration handles it.ModelOpt FP8 helpers
examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py— HV-1.5 calibrator. Force-exports FP8 weights, patchesquant_algo: FP8, hides quantizers during save. MHA quantizers off by default.examples/quantization/quantize_wan2_2_modelopt_fp8.py— Wan2.2 TI2V-5B calibrator. Same design.examples/quantization/check_modelopt_fp8_export.py— verifier. Reads safetensors header dtypes, checksquant_algo: FP8, classifies scale granularity (per-tensor / per-channel / per-block).vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml+wan2_2_ti2v_dit_fp8.yaml— serving stage configs with auto-detect.Adapter (this PR also fixes a general-purpose bug in #2913's adapter):
modelopt_fp8.py:_get_weights_mappernow walks submodules to aggregatehf_to_vllm_mapperfrom whichever sub-module defines it. The adapter is instantiated with the wholePipeline, so model-specific remaps (like Wan2.2'sffn.net.0.→ffn.net_0.) must be discovered on the transformer submodule, not the top-level Pipeline. Fixes silent-noise output that occurred on Wan2.2 ModelOpt FP8 before this change.WanTransformer3DModel.hf_to_vllm_mapperadded with that remap.Both calibrators share
--weight-block-size 'M,N'for block-wise FP8, and the same fallback pattern:_force_export_quantized_weights+_patch_quant_config+hide_quantizers_from_state_dict— because ModelOpt'sexport_hf_checkpointdoesn't handle diffusers-video checkpoints natively.Validation — HunyuanVideo-1.5 (1×H100 80GB, T2V 480×832, 33 frames, 30 steps, seed=42)
torch.compileenabled (default).Engine signals confirming the path is wired correctly:
factory.py: Building quantization config: fp8→Building quantization config: modelopt— auto-detect upgraded the user's--quantization fp8flag to ModelOpt based onquant_algo: FP8intransformer/config.jsondata.py: Auto-detected quantization 'modelopt' from model config__init__.py: Selected CutlassFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod— the ModelOpt FP8 kernel selectedVisual comparison — HunyuanVideo-1.5
BF16 baseline:
hv15_bf16_compiled.mp4
ModelOpt FP8 (this PR):
hv15_modelopt_fp8_compiled.mp4
Same prompt (
"A dog running across a field of golden wheat."), same seed, same sampling params. Output is BF16-equivalent — no detail collapse or composition drift like the online FP8 path showed in #2920.Validation — Wan2.2 TI2V-5B (1×H100 80GB, T2V 704×1280, 49 frames, 30 steps, seed=42)
torch.compileenabled (default).Engine signals:
factory.py: Building quantization config: fp8→modelopt(auto-detect fired)data.py: Auto-detected quantization 'modelopt' from model config__init__.py: Selected CutlassFP8ScaledMMLinearKernel for ModelOptFp8LinearMethodweight_scalewarnings after thehf_to_vllm_mapperfix for Wan2.2'sffn.net.0.→ffn.net_0.diffusers↔vllm-omni name remap.Visual comparison — Wan2.2 TI2V-5B
BF16 baseline:
wan22_bf16_v4.mp4
ModelOpt FP8 (this PR):
wan22_modelopt_fp8_v4.mp4
Same prompt (
"A dog running across a field of golden wheat."), same seed, same sampling params. Output is BF16-equivalent.How to use
Pre-calibrated checkpoints are published on HF Hub so reviewers can test without recalibrating:
shunyang90/HunyuanVideo-1.5-480p-ModelOpt-FP8shunyang90/Wan2.2-TI2V-5B-ModelOpt-FP8Option A: use the published checkpoints (no calibration needed)
Option B: calibrate from BF16 yourself (reproducibility / custom prompts)
Test Plan
HunyuanVideo-1.5
quant_algo: FP8, 648F8_E4M3tensors, per-tensor scale granularityAuto-detected quantization 'modelopt')Wan2.2 TI2V-5B
quant_algo: FP8, 300F8_E4M3tensors, per-tensor scale granularityhf_to_vllm_mapperfix — see adapter change below)Both
torch.compileenabled (default) on both BF16 and FP8 for fair comparisonKnown limitations
ModelOptFp8Config/ModelOptFp8LinearMethodonly dispatches per-tensor scales — a block-wise checkpoint crashes at load with a shape-mismatch assertion inparameter.py:_assert_and_load. Per-tensor serving is the shippable path;--weight-block-sizeis kept in the calibrator for when upstream gains block-wise dispatch.Follow-ups (still Phase 1, other video/variant coverage)
strategy: blockvllm-project-org/Depends on #2913. References #2920 (online-FP8 ablation reference, will not merge).
cc @baonudesifeizhai @hsliuustc0106 @ArtificialRay